A Survey on Learning Objects' Relationship for Image Captioning

Image captioning is a challenging modality transformation task in computer vision and natural language processing, aiming to understand the image content and describe it with a natural language. Recently, the relationship information between objects in the image has been investigated to be of importance in generating a more vivid and readable sentence. Many types of research have been done in relationship mining and learning for leveraging into the caption models. This paper mainly summarizes the methods of relational representation and relational encoding in image captioning. Besides, we discuss the advantages and disadvantages of these methods and provide commonly used datasets for the relational captioning task. Finally, the current problems and challenges in this task are highlighted.


Introduction
Image captioning  is to understand the content of an image and further inference a natural sentence to describe it. Te generated description needs to achieve satisfactory accuracy, adequacy, and readability [9,[31][32][33]. Readability requires the sentences to satisfy grammatical rules, the accuracy makes the content of generated sentences conform to the content of images, and the adequacy measures the adequacy of the generated sentences to express the image information. Te adequacy and accuracy of the sentence include whether the visual vocabulary (describing the category and attributes of the object) and the relational vocabulary (describing the relationship between the objects) are fully refected and whether they conform to the image's content.
Te early captioning methods theoretically use imageto-text retrieval [1,34] or flling sentence templates [35][36][37] to improve the adequacy and accuracy of the generated sentences. In technical, they mainly use the static object categories and the statistical language model. In technical, they mainly use the static object categories and the statistical language model. About retrieval methods, Aker and Gaizauskas [34] used a dependency model to summarize the information contained in multiple web documents and localize this information to images. Kulkarni et al. [1] used conditional random felds based on the objects detected in the image to predict the image's label for retrieval. About templates' methods, Li et al. [35] proposed a network-scalebasedn-gram method to collect candidate phrases and other form sentences. Yang et al. [36] proposed a language model trained on the English Gigaword corpus to obtain the action in the image and incorporated them into a hidden Markov model. Lin et al. [37] used a 3D visual analysis system to represent objects, attributes, and relationships in images. Tey transformed them into a series of semantic trees, from which they learned grammar and generated sentences.
However, the early captioning methods [1,[34][35][36][37] are sufered from few shortcomings. Te template-based methods would make the generated sentences rigid and lack readability. At the same time, the retrieval would lead to mismatches between images and texts, afecting accuracy or adequacy. With the development of the deep learning technology [38][39][40][41][42][43][44][45][46][47][48][49], Vinyals et al. [2] proposed an encoder-decoder model, which uses convolutional neural networks [40] to understand objects and scenes in images, and uses LSTM [44,50] to model the long-term dependency between words. Specifcally, the generation of individual words in a sentence depends on the memory state and the image's global information. Xu et al. [3] incorporated an attention mechanism with the encoder-decoder framework to align text to specifc regions in an image. Lu et al. [4] proposed an adaptive attention method that utilizes visual sentinels to align nonvisual vocabulary during sentence generation. In the related multimodal feld [51][52][53][54][55][56][57], Ding et al. [58] introduced the attention mechanism to the video captioning, so that the model can adaptively focus on the elements, parts, or details in the image when dealing with each frame. Qin et al. [59] considered the visual coherence of the attention region and introduced the memory ability in the attention mechanism. For alleviating the accumulated error on sentence generation, they proposed a new language model which generates sentence chunks by chunks instead of words-by-words.
Furthermore, to more accurately align objects with words, Anderson et al. [5] adopted an object detection network to detect objects and constructed a two-LSTMs' decoder to learn the dependencies between words in sentences and the alignment between words and image regions. For enhancing the vocabulary coherence between words and syntactic paradigm of sentences, Ke et al. [60] proposed a new LSTM variant which considered the previous generated words and their relative positional information during decoding. Tis perception can also bring great improvement when integrating it with the image captioning models. Ding et al. [61] were inspired by the perception of the human brain and adjust the attention weight of each object according to its own color, area of bound box, and visual permutations.
In recent years, with the development of full-attentive models [9,14,18,62,63], Vaswani et al. [64] proposed the Transformer to use attention to learn interactions of intermodality and intramodality. Tey obtained excellent achievements in natural language processing, such as machine translation. Zhu et al. [6] applied transformer to image captioning and confrmed the efectiveness of the transformer in the captioning task. Te transformer learns the interrelationships between object attribute features in visual sequences through the encoder and utilizes attention in the decoder to align text features with visual features. Under the object features [38,65] provided by the pretrained object detection network [38,43,65], the accuracy and adequacy of the visual vocabulary generation are signifcantly improved with the reinforcement learning strategy [12,66]. On the other hand, BERT-based vision-language pretraining methods [67,68] concentrate on designing a unifed framework for multiple vision-language tasks, which frst optimize the object's features by specifc pretraining objectives and then generating sentence after fnetuning the features with the caption objective. Tose methods have achieved a new higher-level performance in image captioning. Furthermore, Li et al. [69] have designed a decoupled encoder-decoder framework with a scheduled sampling strategy for countering the incompatibility between VL understanding and caption generation. Recently, Li et al. [70] have used the cross-modal retrieval technique to generate a primary sentence and refne its content with the transformer blocks, which extremely improved the model performance in the end-to-end training mode. In order to have a better caption development, a unifed codebase [71] has been proposed which covered many high performance modules in each stage of the cross-modal analytics between vision and language in the multimedia feld.
Since 2019, some studies [62,63,[72][73][74] have begun to focus on characterizing the relationship between objects based on the abovementioned works to improve the generation of relational vocabulary. For modeling the objects' relationships, researchers frst start from the basic spatial relationship to explicitly perceive relational information and establish alignment with relational words. Ten, they take a far more step to mine the higher-level semantic relationships hiding in the image. In this process, low-level geometric spatial features are less difcult to be constructed, but the constructed features are also less capable of representing complex relationship categories in textual modality. Te relationship between objects can be refected by multiple relationship categories with similar meanings, which belong to multirelational data. In the case of multirelational data in images, fnding higher-level relational features is a difcult challenge. After feature construction, how to efectively combine relational features in the feature optimization stage so that the optimized features can have good separability for diferent relational categories is a problem worth studying. In order to follow up the development of relational image captioning, it is necessary to overview the previous works about relationships and assist the following researchers in improving the intelligence of captioning models. Tis paper mainly classifes and summarizes the extraction methods of this relational information and their corresponding encoding methods in the current image captioning. According to the frame shown in Figure 1, we overview the main line of relational captioning and summarize a taxonomy of relational methods. Meanwhile, the commonly used datasets and evaluation measures are available in this paper. Te advantages and disadvantages of methods and future development prospects are analyzed.

Contributions.
Our contributions in this paper are shown as follows: (1) Combining all previous studies in relational image captioning, we summarize a taxonomy of relational information processing in the image, which includes feature construction and encoding. Meanwhile, we introduce the corresponding methods and analyze their strength and weakness. (2) We review the relevant datasets involved in the relational image captioning, covering relational understanding and image captioning datasets. Te metrics used in evaluation are also recorded in this paper. Tis paper is organized as follows: the second section briefy introduces the content of the visual branch in relational captioning, mainly about the basic knowledge and overall framework commonly used in the relational image description. Te third section explicitly describes the construction of relational features in images. Te fourth section mainly describes the encoding of relational information. Te ffth section mainly describes the datasets and related evaluation indicators used to extract and learn relational data in image captioning. Te sixth section concludes and presents the prospect of future development in this feld.

Backbone
Te backbone of relational captioning is the standard encoder-decoder framework [2][3][4] as the common captioning task. It is irrelevant to the relationship but is necessary to discuss for constructing the whole procedure. As shown in Figure 1, the backbone consists of two parts: encoder and decoder. Given an image I, relational captioning begins with objects detected from the object detector [38]. Te encoder refnes each element in the visual sequence and further feed it into the decoder for generating a natural sentence.

Full-Attentive Encoder.
Initializing from the visual sequence V � v 1 , v 2 , . . . , v n , the purpose of the encoder is to enrich each object's feature. Recently, transformerdominated full-attentive models [2] play an important role in relational captioning. Te most important component in transformer is the scaled dot-product attention operator, whose structure is shown in Figure 2(a). Its calculation formula is shown as follows: It calculates the similarity of each query vector q ∈ R d in the query matrix Q ∈ R N×d and each key vector in the key matrix k ∈ R d . Te generated attention weight E � QK T . E is multiplied with V so that each output vector comes from a weighted sum of each element in V and its corresponding weight in the weight matrix. Meanwhile, to further enhance the model representation ability of the attention operator [64] and speed up the convergence of the model during the training process, the multihead attention mechanism [64] is combined with the conventional attention operator, as shown in (b) in Figure 2. Its formula is calculated as follows: i is the index of each head. Each head is a segmentation of the original feature space. Te dimension of each subspace is d/h, where h is the number of total heads. Te multihead attention mechanism performs self-attention calculations in each subspace and further fuse all outputs from each subspace with Concat. After passing through the encoder, the optimized sequence of object features is fed into a subsequent decoder to generate sentences.

LSTM-Based Decoder.
Decoders for relational captioning are various language models, commonly using LSTM [44], transformer, and their variants. We denote the output of the encoder as X � x 1 , x 2 , . . . , x n . Given X, Anderson et al. [5] build a decoder with two LSTMs, which contain an attention LSTM and a language LSTM, respectively. Te attention LSTM takes the word embedding vector w t−1 and the hidden layer state of the language LSTM h l t−1 at the last moment and the global visual feature (average of all object features) g as the input to calculate the current moment's hidden layer state h a t .
As an attention query, h a t computes the attention score α c t,i with each element of X. Te α c t is the context attention weight for fusing X into a context vector. Te language LSTM takes the current hidden state h a t of attention LSTM and the context vector to generate the current word representation w t .  Computational Intelligence and Neuroscience

Refective Decoder.
In the word-by-word decoding process, modeling the previous content and the positional information of each word is benefcial for generating words in the current time step. Ke et al. [60] enhance the LSTMbased decoder with refective attention and refective position modules. In the LSTM-based decoder, the output of language LSTM h l t is followed by a linear function for generating the current word. In the refective attention module, it replaces h l t with an attended result h l t reasoned by the previous generated content.
where α ref i,t is the attention weight corresponding to each h l i in i-th time step. Besides, h l t is constrained by the relative position of each word in the sentence with a loss function which minimizes the distance between h l t and t/n, where t is the time step of each word and n is the length of the sentence.

LSTM-Based Decoder for Graph.
For introducing the graph structure into the language decoder, Chen et al. [74] proposed a variant of a conventional two-LSTMs decoder which consists of two modules: graph-based attention mechanism and graph update mechanism. Te graph-based attention mechanism computes two attention weights: α c t and α f t . α c t is the context attention weight which follows the two-LSTMs decoder. α f t is the fow attention weight which constrains the model to attend the semantically relevant node within the neighbors of the previous attended one. Specifcally, it is a soft interpolation of the three fow scores with a dynamic gate. According to the diferent moving steps, the three fow scores are computed with the adjacency matrix M f : (1) stay at the same node α f t,0 � α t−1 , (2) move one step α f t,0 � M f α t−1 , and (3) move two steps α f t,2 � (M f ) 2 α t−1 . Te fow attention is computed as follows: Te fnal attention weight α t takes a balance between α c t and α f t with a gate function. To avoid repetition and omission in the attention process, Chen el al. [74] use a graph update mechanism to dynamically remove or preserve some nodes with a visual sentinel u t .
Te scalar u t,i indicates whether the generated word expresses the attended node. For avoiding repetition, an erase gate for the i-th node e t,i is computed according to its visual sentinel u t,i . Meanwhile, if a node needs multiple access, an add gate for the i-th node a t,i is also computed to preserve its status.
where f * are fully connected networks and θ * , W * , and w * are the learnable parameters.

Transformer Decoder.
Te transformer decoder proposed by Vaswani et al. [64] is also widely used in image captioning, which consists of multiple sublayers. Te textual features in each sublayer frst learn the interaction within its modality through self-attention, then align specifc object features through the cross attention between the textual features and X. Tey fnally pass the fully connected layer to generate the representation w t of the word at the current moment. w t fnally generates the corresponding word through the mapping matrix and the softmax function.
In summary, relational image description's overall process is generating sentences through the visual branch. At the same time, the relational branch processes the objectlevel relational features to be integrated into the visual branch. In the vision branch, given an image I, the object feature sequence V obtained by target detection is used as input, and then X is obtained by encoder learning. Te commonly used models in encoders are mainly transformer encoders or graph convolutional networks [72][73][74][75][76]. Ten, V e is input to the transformer decoder or double LSTM to generate natural sentences word-by-word.

Relational Branch
Te relational branch is the core of relational captioning. It concentrates on the encoder part and incorporates the relationship between objects into the encoder. It includes two steps:(1) feature construction and (2) relational encoding. Te relationships in image can be divided into two categories: (1) position relationships and (2) action relationships, corresponding to the positional words and predicate words. As shown in Figure 3, the position relationship refers to the geometric relationship between the objects, which can be expressed as positional words in sentences, such as "in" and" on." On the other side, the action relationship represents more complicated and higher-level semantic relationship between the subject and the object. In textual modality, a predicate generally represents one kind of action relationship, As shown in Figure 3. Tis section mainly introduces diferent relational feature construction methods and feature encoding methods according to the diferent types of relations.

Feature Construction.
Te frst step in relational captioning is extracting and constructing relational features. Many studies have explored the relationship between objects in images in visual relationship detection and scene understanding. Te position relationship represents the up-down, left-right relationship between two objects in the 2dimensional space. It corresponds to the words describing the position in the sentence, such as "on" and "near". Te action relationship between objects represents a specifc action, which is corresponding to a particular predicate verb in the generated sentence. Figure 3 defnes the abovementioned two kinds of relationships. In this section, we mainly summarize the current extraction methods of these two kinds of relational information and list the advantages and disadvantages of each technique.

Positional Relationship.
Te positional relationship between objects is usually represented by the geometric relationship between two objects' bounding boxes in two-dimensional space. Given an image I and N object boxes in it, the position vector of each object box is represented as (x i , y i , w i , h i ), and the geometric relationship between the object boxes includes the relative distance, relative angle, and relative area between the object boxes. According to the diferent data structures, the representation methods can be divided into two types: (1) tensor and (2) graph.

Relative Geometric Tensor.
Te main idea is to construct a N × N × d tensor to represent all N × N object pairs. Each of these relations is a d-dimensional vector. Herdade et al. [62] and Guo et al. [63] used the relative distances of the box's center and relative size ratios between objects' boxes to construct geometric vectors: (8) Te subscripts i and j represent the image's i-th and j-th objects. Te external logarithmic function plays a numerically stable role in ensuring that when the width and height of the object box i are very small. Te output value will not be too far away from the mean value, resulting in excessive variance and making the model difcult to converge. All the N × N object pairs' geometric vectors form the N × N × 4 geometric tensor. Meanwhile, the activation ReLU flters the negative elements when two objects' boxes are very close.
In summary, the geometric feature mainly describes the relative distance between the center points of the two object boxes and the relative size ratio between the object boxes. It can provide basic prior information about the object's size and location, which is very helpful for image understanding. However, the geometric features extracted by this method are not enough to represent high-level semantic relationship categories, and they are also interfered by the scale information of the bounding box when representing diferent spatial orientations, that is, the amount of relationship that needs to be calculated is large, and all object pairs in the image need to be considered in the calculation process. In practical use, if a complex network model is constructed to learn geometric feature tensors, it often brings a lot of computational costs. To a certain extent, the learning ability of the model for the position relationship information between objects is limited.
PE(pos, 2k + 1) � cos pos where i and j are the row and column indices of the grid, respectively, and PE * is the position encoding vector of the d/2 dimension. pos is the corresponding position, and k is each dimension. For object features, it directly maps the coordinates to the feature space. Its formula is as follows: where B i � (x min , y min , x max , y max ) are the coordinates of the upper left corner and lower right corner of the object bounding box. W emb is the embedding matrix. Absolute geometric features are geometric features aimed at fxed image regions, which can efectively improve the spatial separability of features, but they lack fexibility.

Geometric Graph.
Te data structure of a graph can naturally use edges to represent the relationship between nodes. Terefore, using the graph to represent the relationship in relational captioning is natural. Specifcally, for the graph structure data G � (V, E), its composition includes the node set V and the edge set E. Each node corresponds to an object in the image. In related tasks in the multimodal feld, nodes generally contain corresponding node features, and the representation matrix of all nodes in the node set is X ∈ R n×d . In addition to the nodes, each edge in the edge set is represented as At the same time, if edge features are required, all edge feature matrices are X e ∈ R m×c , where the feature of each edge between i-th and j-th objects is a c-dimensional vector X e i,j ∈ R c . Since the edge represents the relationship between two objects, it can be expressed formally as follows: <subjectrelation-object>, where subject indicates that the subjectobject corresponds to v i , an object indicates that the object corresponds to v j . Te neighbors of a node v can be expressed as One approach to embedding relational information into the edges is to classify the positional relation and assign it as a label to each edge. Yao et al. [72] discretized the positional relationship based on the geometric features between two objects' boxes and assigned categories to each edge to build a directed graph. Specifcally, according to the diference in the positional relationship between the two object boxes, they can be divided into 11 categories, as shown in Figure 4. Specifcally, categories 1 and 2 are the inclusion and included relationships between the subject and the object, respectively. Category 3 is the overlapping relationship between the two objects with their IoU greater than or equal to 0.5. Te remaining categories are divided into 8 categories according to the relative angle between the center points, representing 8 diferent positions, respectively. After classifying the positional relationship into a number of specifc categories, the corresponding label is further assigned to each edge to construct the graph. An example of its graph structure is shown in Figure 5(a), which belongs to a directed fully connected graph. Te feature corresponding to each edge is a specifc category of positional relationship.
In summary, the graph-based approach can naturally utilize the adjacency matrix to characterize the relationship between objects. Te graph is more interpretable and controllable than the tensor method. Te tensor method is equivalent to processing an undirected fully connected graph when it uses full attention for subsequent learning. However, the relational content represented by each edge in the graph still depends on a small number of spatial categories, which result in poor performance in representing complex relational words in sentences.

Motion Relationship.
Te action relationship between objects is more specifc than the positional relationship, which refects the relationship at a higher semantic level. With the diferent data structures, the motion relation can also be divided into the following two forms: (1) tensor and (2) graph. Te frst method is more intuitive. Te complexity of the motion relation makes it difcult to represent by the geometric feature. Terefore, many studies [73,74,[78][79][80][81] begin to directly mine the information from the image content, extract the features of relevant image regions, and represent them in the form of tensor. Te second method uses the graph pretrained by the upstream tasks to generate a suitable graph.

Semantic Tensor.
Given an image and its N objects, the motion relation is represented in the form of a N × N × d tensor. Specifcally, for the action relationship between object i and object j, the tensor-based method attempts to extract the union content of the two objects in the image to represent the corresponding relationship. Te extracted image area must contain two objects' bounding boxes simultaneously to ensure that the extracted content contains an accurate action relationship and avoid other noises as much as possible. Te image region from which Zhang et al. [82] extracted features is the minimum circumscribing moment of the two object boxes, as shown in Figure 5. Specifcally, for the coordinate (x i , y i , w i , h i ) of the object i and the space coordinate vector (x j , y j , w j , h j ) of the object j, the coordinate of the union box is follows: Te union image area passes through the pretrained convolutional network to obtain the corresponding features. Each image can obtain a relation matrix of N × N × d for diferent downstream tasks.
In summary, the tensor-based method stores the image features that characterize each relational region into relational tensors for the subsequent learning of relational information. Tis method is relatively straightforward, but it inevitably introduces noise. Te noise here refers to relational information that is irrelevant to the relation contained in the generated sentence. At the same time, in general, there are many objects obtained by object detection. In the image description task, the model directly calculates all N × N relational features will bring a lot of computational costs. In terms of model performance, the quality of generated sentences is determined by the extracted features, which further depend on the structure of the pretrained convolutional network and its training objectives in upstream tasks. Tis leads to researchers needing to spend more energy on additional tasks. At the same time, after considering the additional pretrained network, the caption model is more computationally intensive overall.

Semantic Graph.
Te graph method use pretrained relationship detection networks in visual relation detection to extract action relations between objects and construct corresponding scene graphs. Specifcally, Yao et al. [72] used the abovementioned method to build the graph, as shown in Figure 5. Te pretrained model predicts the action relationship and uses the relationship category as the edge label. In each relational tuple <subject-predicate-object>, the subject and object are the 2048-dimensional attribute feature from the object detection network's RoI pooling. Te image region feature corresponding initializes the feature of the predicate to the minimum circumscribing moment of two bounding boxes belonging to the subject and object. Te above features are concatenated together and then input to the subsequent classifcation layer for obtaining the relationship category of the predicate. Te N × (N − 1) relational tuples are input into (excluding self-relations) the Computational Intelligence and Neuroscience 7 relational classifcation network. Edges with a probability larger than 0.5 are kept to form an action graph, as shown in Figure 6(b). Yang et al. [73] constructed scene graphs based on reference sentences in the training phase to reconstruct the sentence to accomplish the auto-encode training. Te scene graph divides its nodes into three categories: object nodes, relational nodes, and attribute nodes. For each <subjectpredicate-object> tuple, the subject and object correspond to the object node o i and o j . Te l attribute of the object corresponds to the attribute node a i,l , and the relationship between the two objects i, j corresponds to the relationship node r ij . Each node in the scene graph is represented by a feature vector of e o , e a , e r ∈ R d , respectively. Te object node o i and all of its attribute nodes a i,l have connections by an edge from the object node to the attribute node. If there is a relationship node, the subject-object node o i will frst connect to the relationship node r ij , and then the relationship node r ij will connect to the object object node o j . Te constructed graph is shown in Figure 6(c). In terms of implementation, they adopt the scene graph constructor used in [83] frst to convert sentences into syntactically independent trees and then convert the trees into scene graphs according to the rules mentioned in [75].
Chen et al. [74] designed a customized captioning model to generate sentences according to an abstract graph. Te abstract graph is a scene graph customized according to the user's wish. Te diferent forms of description graphs determine the level of detail in the generated caption. Specifcally, the abstract graph is constructed by the combination of three types of nodes: (1) object nodes, (2) attribute nodes (representing a specifc attribute of an object node), and (3) relationship nodes. Te construction of the abstract graph is to add the nodes and edges into the graph according to the user's interests. Specifcally, given all N object boxes of an image, if the user wants to know the content of the i object box, the object node o i is added to the abstract graph. At the same time, if the user wants to know about the attribute characteristics contained in the object node o i , l attribute nodes are added, and each attribute node corresponds to a path from o i to a i,l directed edges. If the user wants to describe the relationship between two objects, add the corresponding relationship node r i,j in the abstract graph, and build the edge connection between the subject and the object. Te subject-object node o i points to the relationship node r i,j , and then the relationship node r i,j points to the object object node o j . Te features corresponding to the object nodes and attribute nodes in the abstract graph adopt the visual features of the corresponding object bounding box. Te extraction method for the relational node is mainly used to extract the union frame features of two objects. Te result of its construction is shown in Figure 6(d).
In summary, the graph method represents more complex action relationships between objects than the tensor method. At the same time, some unnecessary relationship information is also eliminated, which can better retain important relationship content. Tere has also been a more signifcant improvement in computational cost and model performance. But the disadvantage is that it depends on the efectiveness of the relationship detection network and relies on training additional relationship information, which increases the complexity of the entire process. In the geometric graph, each edge represents a certain orientation. But in the semantic graph, each edge directly corresponds to a relational category. Tis more detailed representation of the relationship makes the semantic graph more efective to model the alignment of relational words. However, the limited number of relational categories also limits the variety of generated relational words. At the same time, the semantic similarity between diferent categories is also eliminated due to the classifcation operation.

Relational Encoding.
For a diferent type of relational data structure, the encoding methods can be divided into two methods: (1) tensor-based method and (2) graph-based method. Tis section mainly focuses on diferent relational encoding methods used in relational captioning.

Tensor-Based Method.
Te tensor-based method is adopted when the positional relation information or the action relation information is extracted as a relation feature tensor. In this case, each image will correspond to a relational feature tensor N × N × d. If it is a geometric feature tensor between objects, then d is of size 4. And if it is the relational feature tensor extracted from the relational action information between objects, then the data of d depend on the dimension of the model.

Geometric Multiplier.
For the geometric tensor, Herdade et al. [62] used the tensor as a multiplier to adjust the attention weight in the self-attention of the encoder side. In Section 2, the weight calculation in the self-attention operator relies on the similarity between the query vector and the critical vector. Te geometric tensor, the prior information of the positional relationship between objects, is used to adjust each weight element in the self-attention operator. Herdade et al. [62] use the following formula: where λ(i, j) represents the (i, j)th vector in the geometric tensor. Emb is an embedding layer, which frst maps the geometric vector of 4 dimension to high-dimensional feature space and then calculates each element's positional information through sinusoidal position encoding. Finally, the d-dimensional vector is transposed to a scalar factor through W G , and negative values are fltered through the ReLU activation function. Noted that the attention weight E in the self-attention operator describes the similarity of i-th and j -th objects in each element, which is the same as the geometric tensor (describing the positional information of i-th and j-th objects). As a result, taking ω i,j G as the scaling factor, adjust the element with the same i and j indexes in the attention weight E. Te formula is shown as follows and Figure 7(a) shows the framework: 8 Computational Intelligence and Neuroscience Te geometric multiplier is designed to modulate the attention weight between each pair of objects for introducing the prior positional knowledge. Each value of the conventional attention weight E is like the similarity between i-th and j-th objects. With the shape identity, each value of the geometric tensor is assigned to the corresponding value with the same index in the attention weight. It is an ingenious and convenient way to introduce positional information in interactive learning. However, the efectiveness of generating better sentences is agnostic and uncontrollable.

Geometric Bias.
In addition to scaling the similarity between the i-th and the j-th object in the weight matrix, Guo et al. [39] adopted a biased method to adjust attention weight. Specifcally, the geometric tensor passes through a series of functions and is added to the original weight matrix as a deviation. Guo et al. [39] designed 3 functions for three types of geometric bias: (1) content-independent geometric bias, (2) query-dependent geometric bias, and (3) key-dependent geometric bias. Te content-independent geometric bias is reasoned from the geometric tensor and is independent of the visual content. Te geometric tensor is transformed into a scalar through a learnable parameter w T g . Ten, it is directly added to the weight in the self-attention operator after being fltered by the ReLU nonlinear function. As shown in Figure 7(b), its calculation formula is as follows: Unlike the independent bias, the query-dependent and key-dependent geometric biases take a further step to compute the similarity with the visual query or key. As shown in Figure 7(c), the specifc calculation method is as follows: Compared with the previous method, Luo et al. [83] used the geometric tensor, including the absolute position geometric tensor and the relative position geometric tensor. Te absolute position geometric tensor is directly added to the query vector and key vector as the position feature vector, and the relative position geometric tensor is added as the deviation of the attention weight E. As shown in Figure 7(d), the calculation formula is as follows: where pos * is the absolute position geometry tensor corresponding to each element in the query vector or key vector. Ω is the relative position geometry tensor. Like the multiplier method, the tensor-based process uses each element of the geometric tensor to function on the element of the attention weight with the same position. Tis method is straightforward and efective but less interpretable.

Graph-Based Methods.
Te graph-based method is specifc to processing the graph data. Te graph-structured data flter some unreasonable relationships through the prior knowledge learned in the pretrained model.

Label-Aware GCN.
Yao et al. [72] designed a graph convolutional network to take the knowledge from the labeled edge and its direction ( Figure 8). Each node considers all the connected labeled edges to fuse the relational label and its connected nodes. Specifcally, each image can be transformed into a semantic and positional graph to represent the motion and position relation. Te semantic graph is directed, and its edges are labeled with the action relationship. Te positional graph is an undirected graph with labeled edges. To make the graph convolutional network aware of the edge's label and its direction, each layer is designed as follows: where W di r(v i ,v j ) selects diferent transformation matrices according to the type of each edge. Specifcally, if the i object v i is the subject in a relation tuple <subject-relation-object>, then the transformation matrix is W 1 ; if the i object v i is the object, then the transformation matrix becomes W 2 .  Similarly, when dealing with the self-connected edge, the transformation matrix is set to be W 3 . lab(v i , v j ) represents the category of the edge. g v i ,v j is a weight function to determine the importance of the edge in the calculation. Compared with the conventional GCN, the label-aware GCN introduces the relationship information in each edge with the corresponding relational label. Te label triggers the embedding function to form the edge features to fuse the connected nodes' relational information further. By introducing the graph, the connection between nodes determines the interactive learning and guides the model to generate the content between corresponding objects. It is more explainable than the geometric methods, which use the full-connected graph.

Scene Graph Auto-Encoder.
Yang et al. [73] proposed the Scene Graph Auto-Encoder (SGAE) model to learn a recoder to optimize the original visual features through reconstruction of the sentence in training. Te scene graph is constructed from the ground-true sentence, and each visual feature further fuses features according to the connection in the graph. It is shown in Figure 6(c), which includes object nodes, relational nodes, and attribute nodes.
x r ij � g r e o i , e r ij , e o j , g a e o i , e a il , where x r ij is the node feature of the relation node r ij , and its neighbor node features e o i , e r ij , and e o j belong to the corresponding node in the relation tuple <o i -r ij -o j >. x a i represents the attribute information of the i object node, and its neighbor e o i and e a il belong to the object node i and l-th attribute feature. An object may have multiple attributes, each attribute corresponds to an attribute node. N is the total number of all attributes. x o i represents the feature of the i-th object node, <o i -r i * -o * > represents all the tuples whose i -th object as the subject. <o * -r * i -o i > represents all the tuples whose i-th node is the object. After passing the abovementioned embedding, they use the form of a memory network to set up a dictionary matrix D ∈ R d×V to optimize the input node feature x. Te calculation formula is as follows: Te optimized feature x is input to the subsequent decoder to regenerate the sentence and compare with the real input sentence. Te error is fed back to the network for self-encoding training. Te auto-encoder method uses the reconstruction to learn the semantic knowledge which begins from the sentence and regenerates it. Te semantic knowledge refects in the scene graph and assists the inference process. Te whole framework is shown in Figure 8 3.4.3. Multirelational GCN. Chen et al. [74] proposed a customized abstract graph to generate specifc captions. For representing each node, the features of the object nodes and attribute nodes adopt the visual features of the corresponding object bounding box, which are reasoned from the object detection network. Te union bounding box's feature of two objects is used for the relational node. At the same time, Chen et al. made various types of nodes corresponding to diferent transformation matrices in feature embedding to further distinguish diferent kinds of nodes. Te formula is shown as follows: where W r [k] is the transformation matrix and its three matrices corresponding to three types of nodes. pos[i] adds the order information for diferent attribute nodes a i,l . According to the abovementioned embedding methods, the features of each node in the abstract graph are fused with their adjacency nodes. Meanwhile, the directed abstract graph is converted into an undirected graph which fts with the GCN. Chen et al. [74] designed a multirelational GCN ( Figure 8) so that graph convolution learns diferent sets of parameters according to the edge types. Tere are six different types of edges: (1) object node to attribute node, (2) subject node to relational node, and (3) object node to relational node point and their inverse edges. Te transformation transforms the direct graph into a unidirected graph and feeds into the multi-relational GCN to refne each node's feature. Diferent transformation matrices in each layer of the graph convolutional network are used to map the edges of diferent categories. Specifcally, each layer is calculated as follows: where l represents the diferent layers in the graph convolutional network, the parameters for diferent classes of edges in each layer are shared. Trough stacking encoders, each node feature is learned according to the connection between the nodes in the graph. Te multirelational GCN is based on the abstract graph, which the user designs for generating the customized caption. Te controllable ability has been improved, and the abstract graph determines the attribute, object, and relationship feature fed into the model. In summary, Table 1 summarizes the methods used in relational feature construction and relational encoding by current methods in relational captioning. Tere are 108K images in total and many object annotations, attribute information annotations, and relationship annotations between objects for tasks such as object detection and visual relationship detection. In relational captioning, it is mainly used as a pretraining dataset to pretrain the object detection or the visual relationship detection network. In the pretraining stage, the training, validation, and test dataset split is followed by Anderson et al. [5]. Specifcally, 98K images are used for training, and the remaining 10K images are divided into validation and test sets, respectively. When Yao et al. [72] Computational Intelligence and Neuroscience pretrained the target detection network, the dataset was fltered to retain 1600 object categories and 400 attribute categories. When dealing with pretrained object detection networks, it mainly selects the top 50 standard action relationships and artifcially classifes them into 20 categories. [85] is developed by Microsoft Team with the goal of scene understanding, capturing images from complex scenes, and can perform multiple tasks such as image recognition, segmentation, and captioning. Te dataset uses Amazon's "Mechanical Turk" service to manually generate at least fve sentences for each image. It contains more than 1.5 million sentences. Te training set contains 82,783 images, the validation set contains 40,504 images, and the test set contains 40,775 images. In captioning tasks, the "Karpathy" split [5] is the standard data split method, which takes 5000 images in the validation set for evaluation and 5000 images for testing. Te rest of the training and validation datasets are used for training.

Flickr8K/Flickr30k
. Flickr8k [86] images are from Yahoo's photo album website Flickr, including 8,000 images, 6,000 images for training, 1,000 for evaluation, and 1,000 for testing. Flickr30k [87] contains 31,783 images collected from the Flickr website, mainly depicting human engagement. Te manual label corresponding to each image is still fve sentences.

PASCAL 1K.
It is a subset of the well-known PASCAL VOC challenge image dataset [7], which provides a standard image annotation dataset and a standard evaluation system. Te PASCAL VOC dataset consists of 20 categories. Amazon's Turk Robot service was then used to label each image with fve descriptions manually. Te dataset has the excellent image quality and complete annotation, which is suitable for testing algorithm performance.

Evaluation.
Te evaluation standard of relational captioning is consistent with the standard evaluation used in natural language processing to evaluate the similarity between the generated sentence and the ground-truth sentence. Te evaluation metrics: BLEU [88], METEOR [89], ROUGE [90], CIDEr [91], and SPICE [92]. For the fve metrics, BLEU and METEOR are used for machine translation, ROUGE for automatic translation summaries, and CIDEr and SPICE for image captioning. In principle, the abovementioned evaluation metrics measure the n-gram consistency between generated sentences and reference sentences and are also afected by the importance and rarity of n-grams in the corpus.

BLEU.
As a widely used and essential evaluation metric in machine translation, BLEU [88] mainly measures the degree of the repetition between the generated sentence and the reference sentence. Te number of identical n-grams in both generated and reference sentences determines the BLEU score. With the more signifcant number, the BLEU score is higher, meaning the generated sentences are closer to the reference sentences. With the increase of the n in ngram, BLEU considers the correlation no longer limited to several words but prefers the correlation between contents. Te higher the BLEU score, the better the generated sentences. [89] mainly considers the infuence of synonyms and word forms in comparing generated sentences with all reference sentences. When evaluating the fuency of the sentence, METEOR is computed based on the chunks, which are constructed by considering the combination of semantically consecutive words. Te word's consistency between the candidate and reference sentences is measured by the chunk. At the same time, METEOR is calculated by combining the precision, recall, and F-values of matching various cases. Te higher the METEOR score, the better the sentence performance. [90] is a set of evaluation metrics designed to evaluate text summarization. ROUGE-L is used in relational captioning. It is calculated using the longest common subsequence between the generated and reference sentences. Te score is calculated by summing the recall and precision of the longest common subsequence. Te higher the ROUGE score, the better the sentence performance. [91] is an evaluation metric specially designed for captioning. It measures the consistency of image annotations by performing a term frequency-inverse document frequency (TF-IDF) weight calculation for each n-gram. Tis metric treats each sentence as a "document," represented as a TF-IDF vector, and then computes the cosine similarity between the generated sentence and the reference sentence. Tis indicator makes up for a shortcoming of BLEU, in which all words on the match are treated Table 1: Summary of the various methods in the relational captioning.

Methods
Feature construction Relational encoding Decoder GCN-LSTM [72] Positional relation: directed graph with label Convolutional graph network Two-LSTMs decoder Motional relation: directed scene graph SGAE [73] Motional relation: directed scene graph Auto-encoder Two-LSTMs decoder ORT [62] Positional relation: directed graph with label Attention multiplier Transformer NG-SAN [39] Positional relation: directed graph with label Attention bias Transformer DLCT [83] Positional relation: directed graph with label Attention bias Transformer equally. Meanwhile, it considers the importance of the information of each word itself. Likewise, the higher the CIDEr score, the better the performance. [92] is a semantic evaluation metric for image captions, which measures how efectively image captions recover objects, attributes, and relationships between them. On the image captioning dataset, SPICE can better capture human judgments of model captions than existing n-gram metrics. Table 2 shows the scoring index ranking of the models used in the current relational image description on the MSCOCO dataset.

Conclusion
Tis paper mainly summarizes the procedure of relational captioning and the development of each part in recent years. Te relational captioning further focuses on the relationship between objects in the image. By introducing and incorporating the relationship information, the sentences generated by the model have better sufciency and accuracy. We summarize the framework used in relational captioning and divide the relational procedure into two parts: feature construction and feature encoding. Combined with the characteristics of the relationship between objects, the relationship is further divided into the positional relationship and action relationship. Te methods used for learning each relationship are discussed in the feature construction and encoding stages. In addition, we also summarize the datasets commonly used in relational captioning and the related evaluation metrics of the model.
We conclude by summarizing the current challenges in relational caption and clarifying our vision for this aspect. Tere are two main challenges in relational captioning, which is existed in feature construction and feature encoding. In terms of feature construction, it is challenging to fnd an appropriate method which considers as many relationship categories as possible while satisfying the content correlation between each relationship category on the textual modality. Second, in terms of feature encoding, it is challenging to make the feature perceive the semantic diference of various relational information and maintain its original visual knowledge. According to the abovementioned two challenges, we believe that future work has the following space for improvement in relational captioning: (1) Te feature construction of positional relationships is mainly limited to the handmade geometric feature extracted from objects' bounding box in 2-dimensional space. Te geometric feature is susceptible to the scale of the object box. (2) Te feature of motional relationship depends on the performance of the pretrained feature extracted network. Better features can be obtained by adjusting the training objectives of the pretrained network in upstream tasks. (3) About feature encoding, the current cross entropy or reinforcement learning training objectives make it difcult for the features output by the encoder to fully refect the diferences between diferent relationship categories while retaining visual knowledge. Compared with the end-to-end training method, the current pretraining-fnetuning method [67][68][69] could use specialized objective function to obtain more powerful features. (4) Te alignment between relational features and relational vocabulary is ambiguous. Te generation of relational vocabulary mainly depends on the global image information instead of relational features.

Data Availability
Tis paper is an overview paper in which the data reported are derived from corresponding published research studies. Tese prior studies (and datasets) are cited at relevant places within the text as references.

Conflicts of Interest
Te authors declare that they have no conficts of interest.